Depth information based pose determination for mobile platforms, and associated systems and methods

ABSTRACT

A method includes determining a depth range where a subject is likely to appear in a current depth map based on one or more previous depth maps of the environment, filtering the current depth map based on the depth range, to generate a reference depth map, identifying a plurality of candidate regions from the reference depth map, selecting a subset of the plurality of candidate regions, determining a main region from the subset of the plurality of candidate regions, associating the main region and one or more target regions, identifying the first pose component of the subject from a collective region, identifying the second pose component of the subject from the collective region, determining one or more vectors representing a spatial relationship between the identified first pose component and the identified second pose component, and controlling a movement of a movable object based on the one or more vectors.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of application Ser. No. 16/896,979,filed on Jun. 9, 2020, which is a continuation of InternationalApplication No. PCT/CN2017/115893, filed on Dec. 13, 2017, the entirecontents of both of which are incorporated herein by reference.

TECHNICAL FIELD

The present technology is generally directed to determining the pose orgesture of one or more subjects, such as one or more human users in athree-dimensional (3D) environment adjacent to a mobile platform.

BACKGROUND

The environment surrounding a mobile platform can typically be scannedor otherwise detected using one or more sensors. For example, the mobileplatform can be equipped with a stereo vision system (e.g., a “stereocamera”) to sense its surrounding environment. A stereo camera istypically a type of camera with two or more lenses each having aseparate image sensor or film frame. When taking photos/videos with thetwo or more lenses at the same time but from different angles, thedifference between the corresponding photos/videos provides a basis forcalculating depth information (e.g., distance from objects in the sceneto the stereo camera). As another example, the mobile platform can beequipped with one or more LiDAR sensors, which typically transmit apulsed signal (e.g. laser signal) outwards, detect the pulsed signalreflections, and determine depth information about the environment tofacilitate object detection and/or recognition. However, inaccuraciesexist in different depth-sensing technologies, which can affect varioushigher level applications such as pose and/or gesture determination.

SUMMARY

The following summary is provided for the convenience of the reader andidentifies several representative embodiments of the disclosedtechnology.

In some embodiments, a computer-implemented method for determining apose of a subject within an environment includes identifying a pluralityof candidate regions from base depth data representing the environmentbased, at least in part, on at least one depth connectivity criterion,determining a first region from a subset of the candidate regions ascorresponding to a first pose component of the subject, and selectingone or more second regions from the subset of candidate regions toassociate with the first region based, at least in part, on relativelocations of the first region and the one or more second regions. Themethod also includes identifying the first pose component and at leastone second pose component of the subject from a collective regionincluding the first region and the one or more associated secondregions, determining a spatial relationship between the identified firstpose component and the identified at least one second pose component,and causing generation of a controlling command for execution by amobile platform based, at least in part, on the determined spatialrelationship.

In some embodiments, the base depth data is generated based, at least inpart, on images captured by at least one stereo camera. In someembodiments, the base depth data includes a depth map calculated based,at least in part, on a disparity map and intrinsic parameters of the atleast one stereo camera. In some embodiments, the method furthercomprises determining a depth range where the subject is likely toappear in the base depth data. In some embodiments, the plurality ofcandidate regions are identified within the range of depth. In someembodiments, the base depth data includes at least one of unknown,invalid, or inaccurate depth information.

In some embodiments, the at least one depth connectivity criterionincludes at least one of a depth threshold or a change-of-depththreshold. In some embodiments, two or more of candidate regions aredisconnected from one another due, at least in part, to unknown,invalid, or inaccurate depth information in the base depth data. In someembodiments, the method further comprises selecting the subset of thecandidate regions based, at least in part, on first baseline informationregarding at least one pose component of the subject. In someembodiments, the first baseline information indicates an estimated sizeof the at least one pose component. In some embodiments, the firstbaseline information is based, at least in part, on prior depth data.

In some embodiments, determining the first region is based, at least inpart, on second baseline information regarding the first pose componentof the subject. In some embodiments, the second baseline informationindicates an estimated location of the first pose component. In someembodiments, the second baseline information is based, at least in part,on prior depth data. In some embodiments, selecting one or more secondregions from the subset of candidate regions to associate with the firstregion is based, at least in part, on non-depth information regardingthe environment. In some embodiments, the non-depth information includestwo-dimensional image data that corresponds to the base depth data.

In some embodiments, selecting the one or more second regions toassociate with the first region comprises distinguishing candidateregions potentially corresponding to the subject from candidate regionspotentially corresponding to at least one other subject. In someembodiments, selecting one or more second regions from the subset ofcandidate regions to associate with the first region is based, at leastin part, on estimated locations of a plurality of joints of the subject.In some embodiments, selecting one or more second regions from thesubset of candidate regions to associate with the first region at leastpartially discounts an effect of unknown, invalid, or inaccurate depthinformation in the base depth data.

In some embodiments, the subject is a human being. In some embodiments,the first pose component and the at least one second pose component arebody parts of the subject. In some embodiments, the first pose componentis a torso of the subject and the at least one second pose component isa hand of the subject. In some embodiments, identifying the first posecomponent and at least one second pose component of the subjectcomprises detecting a portion of the collective region based, at leastin part, on a measurement of depth. In some embodiments, detecting aportion of the collective region comprises detecting a portion closestin depth to the mobile platform. In some embodiments, identifying the atleast one second pose component is based, at least in part, on alocation of the detected portion relative to a grid system.

In some embodiments, identifying the at least one second pose componentcomprises identifying at least two second pose components. In someembodiments, identifying the first pose component comprises removing theidentified at least one second pose component from the collectiveregion. In some embodiments, identifying the first pose component isbased, at least in part, on a threshold on width.

In some embodiments, determining a spatial relationship between theidentified first pose component and the identified at least one secondpose component comprises determining one or more geometric attributes ofthe first and/or second pose components. In some embodiments, the one ormore geometric attributes include a location of centroid. In someembodiments, determining a spatial relationship between the identifiedfirst pose component and the identified at least one second posecomponent comprises determining comprises determining one or morevectors pointing between portions of the first pose component and the atleast one second pose component.

In some embodiments, the mobile platform includes at least one of anunmanned aerial vehicle (UAV), a manned aircraft, an autonomous car, aself-balancing vehicle, a robot, a smart wearable device, a virtualreality (VR) head-mounted display, or an augmented reality (AR)head-mounted display. In some embodiments, the method further comprisescontrolling a mobility function of the mobile platform based, at leastin part, on the controlling command.

Any of the foregoing methods can be implemented via a non-transitorycomputer-readable medium storing computer-executable instructions that,when executed, cause one or more processors associated with a mobileplatform to perform corresponding actions, or via a vehicle including aprogrammed controller that at least partially controls one or moremotions of the vehicle and that includes one or more processorsconfigured to perform corresponding actions.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is an illustrative 2D image of an environment in front of amobile platform and FIG. 1B is an illustrative depth map correspondingto the 2D image of FIG. 1A, in accordance with some embodiments of thepresently disclosed technology.

FIG. 2A is an illustrative 2D image of a subject (e.g., human user) andFIG. 2B is a corresponding depth map with defects, in accordance withsome embodiments of the presently disclosed technology.

FIG. 3 is a flowchart illustrating a method for determining a pose orgesture of a subject, in accordance with some embodiments of thepresently disclosed technology.

FIG. 4A illustrates a 2D image of multiple subjects (e.g., two humandancers) and FIG. 4B illustrates subject features (e.g. joints)determined from the 2D image, as well as the grouping and mapping of thesubject features to individual subjects, in accordance with someembodiments of the presently disclosed technology.

FIGS. 5A-5D illustrate a process for identifying a secondary posecomponent (e.g., hand) of a subject, in accordance with some embodimentsof the presently disclosed technology.

FIGS. 6A and 6B illustrate a process for identifying a primary posecomponent (e.g., torso) of the subject, in accordance with someembodiments of the presently disclosed technology.

FIG. 7 illustrates examples of mobile platforms configured in accordancewith various embodiments of the presently disclosed technology.

FIG. 8 is a block diagram illustrating an example of the architecturefor a computer system or other control device that can be utilized toimplement various portions of the presently disclosed technology.

DETAILED DESCRIPTION 1. Overview

Poses or gestures are natural communication modes for human users, whichprovide a promising direction for a human-computer interface applicableto various mobile platforms. However, it can be technologicallydifficult and computationally intensive to determine the pose or gestureof subject(s) based on two-dimensional (2D) image information, and anyresulting distance information may not be accurate. Depth information(e.g., provided by stereo camera, LiDAR, and/or other sensors) providesmeasurements in another dimension, which can be used to improve theefficiency and/or efficacy of pose determination. However, inaccuraciesor other imperfections in the depth information may occur for variousreasons. For example, a stereo camera may not be able to provide depthinformation for a texture-less subject, some part(s) of the environmentmay be overexposed or underexposed, and certain shapes, orientations orcolocations of objects may cause imperfect depth detection.

The presently disclosed technology uses a region-based method todiscount or eliminate the negative effect of imperfections in depthinformation, identify pose components of one or more subjects, anddetermine spatial relationships among the pose components for generatingvarious controlling commands based thereon. An illustrative method caninclude analyzing depth information (e.g., including identifying depthinaccuracies or other imperfections) and determining candidate regionswhich potentially includes at least a portion of a pose component (e.g.,torso, hand, or the like). The candidate regions can be determined basedon various criteria of depth-discontinuity, some of which may be causedby the depth inaccuracies or other imperfections. The illustrativemethod can include generating a collective region by grouping together asubset of the candidate regions based on their spatial or logicalrelationship, thereby “reconnecting” certain candidate regions that wereseparated due to the depth inaccuracies or imperfections. The collectiveregion can be further analyzed for identifying one or more posecomponents therein.

Several details describing structures and/or processes that arewell-known and often associated with mobile platforms (e.g., UAVs orother types of movable platforms) and corresponding systems andsubsystems, but that may unnecessarily obscure some significant aspectsof the presently disclosed technology, are not set forth in thefollowing description for purposes of clarity. Moreover, although thefollowing disclosure sets forth several embodiments of different aspectsof the presently disclosed technology, several other embodiments canhave different configurations or different components than thosedescribed herein. Accordingly, the presently disclosed technology mayhave other embodiments with additional elements and/or without severalof the elements described below with reference to FIGS. 1A-8 .

FIGS. 1A-8 are provided to illustrate representative embodiments of thepresently disclosed technology. Unless provided for otherwise, thedrawings are not intended to limit the scope of the claims in thepresent application.

Many embodiments of the technology described below may take the form ofcomputer- or controller-executable instructions, including routinesexecuted by a programmable computer or controller. The programmablecomputer or controller may or may not reside on a corresponding scanningplatform. For example, the programmable computer or controller can be anonboard computer of the mobile platform, or a separate but dedicatedcomputer associated with the mobile platform, or part of a network orcloud based computing service. Those skilled in the relevant art willappreciate that the technology can be practiced on computer orcontroller systems other than those shown and described below. Thetechnology can be embodied in a special-purpose computer or dataprocessor that is specifically programmed, configured or constructed toperform one or more of the computer-executable instructions describedbelow. Accordingly, the terms “computer” and “controller” as generallyused herein refer to any data processor and can include Internetappliances and handheld devices (including palm-top computers, wearablecomputers, cellular or mobile phones, multi-processor systems,processor-based or programmable consumer electronics, network computers,mini computers and the like). Information handled by these computers andcontrollers can be presented at any suitable display medium, includingan LCD (liquid crystal display). Instructions for performing computer-or controller-executable tasks can be stored in or on any suitablecomputer-readable medium, including hardware, firmware or a combinationof hardware and firmware. Instructions can be contained in any suitablememory device, including, for example, a flash drive, USB (universalserial bus) device, and/or other suitable medium. In particularembodiments, the instructions are accordingly non-transitory.

2. Representative Embodiments

As discussed above, a stereo camera or other sensor can provide data forobtaining depth information (e.g., measurements of distance betweendifferent portions of a scene and the sensor) for an environment thatsurrounds or is otherwise adjacent to, but does not necessarily abut, amobile platform. For example, FIG. 1A is an illustrative 2D image of anenvironment in front of a mobile platform and FIG. 1B is an illustrativedepth map corresponding to the 2D image of FIG. 1A, in accordance withsome embodiments of the presently disclosed technology. As shown in FIG.1A, the environment includes a subject (e.g., a human user), variousother objects, a floor, and walls. The depth map of FIG. 1Billustratively uses grayscale to represent the depth (e.g., distance tothe mobile platform or the observing sensor) of different portions ofthe environment. For example, the darker a pixel appears in the depthmap of FIG. 1B, the deeper (i.e., farther away) the portion of theenvironment corresponding to the pixel is located. In variousembodiments, a depth map can be represented as a color graph, pointcloud, or other forms.

Various imperfections, such as inaccuracies, defects, and/or unknown orinvalid values, can exist in the depth information about an environment.For example, FIG. 2A is an illustrative 2D image of a subject (e.g.,human user) and FIG. 2B is a corresponding depth map with defects, inaccordance with some embodiments of the presently disclosed technology.As can be seen, the subject in FIG. 2A is extending an arm to express apose/gesture. A portion 202 of the extended arm lacks texture (e.g.,appears mostly in white, without variance). In this case, the stereocamera generated depth map of FIG. 2B cannot accurately reflect thedepth of the arm portion 202. Illustratively, in FIG. 2B, a region 204that corresponds to the arm portion 202 includes unknown or invaliddepth data, thereby incorrectly separating a torso portion 206 from ahand portion 208, both of which in fact belong to a same subject. Aperson of skill in the art can appreciate that various other forms ofimperfections may exist in sensor-generated depth information about anenvironment.

FIG. 3 is a flowchart illustrating a method 300 for determining a poseor gesture of a subject, despite imperfections, in accordance with someembodiments of the presently disclosed technology. The method can beimplemented by a controller (e.g., an onboard computer of a mobileplatform, an associated computing device, and/or an associated computingservice). The method 300 can use various combinations of depth data,non-depth data, and/or baseline information at different stages toachieve pose/gesture determinations. As discussed above in paragraph[0023], the method 300 method can include analyzing depth information,determining candidate regions, as well as generating a collective regionfor identifying one or more pose components therein.

At block 305, the method includes obtaining depth data of an environment(e.g., adjacent to the mobile platform) including one or more subjects(e.g., human users). The depth data can be generated based on imagescaptured by one or more stereo cameras, point clouds provided by one ormore LiDAR sensors, and/or data produced by other sensors carried by themobile platform. In the embodiments in which stereo camera(s) are used,the depth data can be represented as a series of time-sequenced depthmaps that are calculated based on corresponding disparity maps andintrinsic parameters of the stereo camera(s). In some embodiments,median filters or other pre-processing operations can be applied tosource data (e.g., disparity maps) before they are used as a basis forgenerating the depth data, in order to reduce noise or otherwise improvedata quality.

In some embodiments, the method includes determining a depth range inwhich one or more subjects are likely to appear in the depth datacorresponding to a particular time (e.g., a most recently generatedframe of a depth map). Illustratively, the depth range can be determinedbased on baseline information derived from prior depth data (e.g., oneor more frames of depth maps generated immediately prior to the mostrecent frame). The baseline information can include a depth location(e.g., 2 meters in front of the mobile platform) of a subject that wasdetermined (e.g., using method 300 in a prior iteration) based on theprior depth data. The depth range (e.g., a range between 1.5 meters and2.5 meters) can be set to cover the determined depth location withcertain margins. In some embodiments, the baseline information can bedefined independent of the existence of prior depth data. For example,an initial iteration of the method 300 can use a predefined depth range(e.g., a range between 1 meter and 3.5 meters in front of the mobileplatform). The controller can generate a reference depth map (e.g., FIG.2B) using only the depth information within the depth range. In someembodiments, the depth range determination is not performed and thereference depth map can include all or a subset of depth informationcorresponding to the particular time.

At block 310, the method includes identifying candidate regions that maycorrespond to pose components (e.g., the user's or other subject'storso, hand, or the like). Illustratively, the controller analyzes thereference depth map to identify all the regions that can be separatedfrom one another based on one or more depth connectivity criteria (e.g.,a depth threshold or a change-of-depth threshold). For example, eachidentified candidate region includes pixels that are connected to oneanother on the depth map. The pixels in each region do not vary in depthbeyond a certain threshold. Each candidate region is disconnected fromany other candidate regions in depth, and in some embodiments, alsodisconnected in one or two other dimensions. As discussed previously, insome cases, the disconnection is caused by unknown, invalid, orinaccurate depth information.

The controller can select a subset of the candidate regions based onbaseline information regarding at least one pose component of thesubject. The baseline information can indicate an estimated size of posecomponent(s) (e.g., size of a hand), based on prior depth data.Illustratively, the controller can estimate a quantity n_(valid) ofpixels on the reference depth map that correspond to an average size ofa hand, using (1) the likely depth location (e.g., 2 meters in front ofthe mobile platform) of the subject as estimated at block 305 and (2) apreviously determined relationship between a pixel quantity (e.g., 100pixels) and an average hand size at a known depth location (e.g., 1meter in front of the mobile platform). In this example, the estimatedpixel quantity maim can be 25 pixels based on operation of geometrictransformations. In some embodiments, a margin (e.g., 20%) or otherweighting factor can be applied to reduce the risk of false negatives,thus the estimated pixel quantity n_(valid) can be further reduced, forexample, to 20 pixels. Accordingly, potentially more candidate regionscan be selected and included in the subset.

The estimated size of the pose component(s) can be used to filter thecandidate regions to generate a more relevant subset for identifyingpose component(s) in a more efficient manner. Illustratively, thecontroller compares the size (e.g., number of pixels) of each candidateregion with a valid size range (e.g., between n_(valid) and a multipleof n_(valid)) defined based on the estimated size, and filters outcandidate regions having sizes that fall outside of the valid sizerange. In some embodiments, the valid size range can be determined basedon estimated sizes of two or more pose components.

At block 315, the method includes selecting from the subset of candidateregions a primary region that potentially corresponds to a primary posecomponent (e.g., torso) of a subject. In some embodiments, thecontroller determines the primary region based on baseline information(e.g., location and/or size) regarding the primary pose component of thesubject. Similarly, the baseline information used here can be derivedfrom prior depth data (e.g., prior reference depth map(s) where theprimary pose component was determined using method 300 in prioriteration(s)). For example, if the subject's torso was detected from aprior frame of depth information, the controller can determine acentroid point of the previously determined torso portion and map thecentroid point to the current reference depth map. The controller canlabel a candidate region as the primary region if the centroid pointlands inside the candidate region. Alternatively, the controller canselect a primary region based on region size (e.g., selecting thelargest candidate region within the subset as corresponding to the torsoof the subject). It should be noted that the primary region may includea portion of the primary pose component, multiple pose components,non-pose components (e.g., head of a human user in some cases) of thesubject, and/or objects other than the subject.

At block 320, the method includes generating a collective region byassociating the primary region with one or more secondary regions of thesubset based on their relative locations. In some embodiments, this canbe achieved based on the analysis of non-depth information regarding theenvironment. As discussed previously, candidate regions can bedisconnected from one another due to unknown, invalid, or inaccuratedepth information. Non-depth information (e.g., corresponding 2Dimage(s) of the environment) can be used to re-connect or otherwiseassociate disconnected regions. Illustratively, joints or otherapplicable subject features can be determined from 2D images. Based onthe spatial and/or logical relationship among the determined subjectfeatures, they can be grouped in an ordered fashion and/or mapped tocorresponding subject(s). The grouping and/or mapping can be based onbaseline information regarding the subject features (e.g., the wristjoint and shoulder joint must be connected via an elbow joint). In someembodiments, the controller can implement a Part Affinity Fields(PAFs)-based method to achieve the feature grouping and/or mapping. Forexample, FIG. 4A illustrates a 2D image of multiple subjects (e.g., twohuman dancers), and FIG. 4B illustrates subject features (e.g. joints)determined from the 2D image, as well as the grouping and mapping ofsubject features to individual subjects, in accordance with someembodiments of the presently disclosed technology. As illustrated inFIG. 4B, the grouping and mapping of subject features can distinguishthe two subjects from each other, and thus avoid incorrectly mappingfeature(s) of a first subject to a second subject. The controller canthen project the subject features to the reference depth map andassociate the primary region (e.g., that includes one or more featuresof a subject) with one or more secondary regions (e.g., each includingone or more features of the same subject) in accordance with the subjectfeature grouping and/or mapping.

Because analyzing non-depth information may consume a considerableamount of computational resources, in some embodiments, the controllergenerates the collective region based on the distance between and/or therelative sizes of the primary region and other candidate regions. Forexample, the controller can select a threshold number of secondaryregions that (1) have a size within the valid size range and (2) arelocated close enough to the primary region to be associated with theprimary region. Once the collective region, including the primary andsecondary regions, is constructed, other candidate regions can befiltered out or otherwise excluded from further processing.

At block 325, the method includes identifying at least a primary posecomponent (e.g., torso) and a secondary pose component (e.g., hand) of asame subject from the collective region. Illustratively, the controlleranalyzes a distribution of depth measurements using baseline informationregarding the pose component(s).

For example, FIGS. 5A-5D illustrate a process for identifying asecondary pose component (e.g., hand) of a subject (e.g., a human user),in accordance with some embodiments of the presently disclosedtechnology. For purposes of illustration, FIGS. 5A-5D do not show depthinformation (e.g., in grayscale). The baseline information regardingpose component(s) can indicate that a hand of a subject is likely to bea pose component that has a shortest distance to the mobile platform.Therefore, as illustrated in FIG. 5A, the controller can search for andidentify a closest point 502 within the collective region 510. Withreference to FIG. 5B, using the closest point as a seed point, thecontroller can identify a portion 504 of the collective region, forexample, by applying a flood-fill method that is initiated at theclosest point 502. Various other applicable methods can be employed toidentify the portion 504 based on the closest point 502, and variousdepth thresholds or change-of-depth thresholds can be used therein. Theidentified portion 504 can include a hand of the subject and potentiallyother body part(s) (e.g., arm).

In some embodiments, the controller can divide or partition thereference depth map, in order to determine the secondary pose componentin a more efficient and/or accurate manner. For example, a grid systemcan be used to further determine the hand of the subject within theidentified portion 504. As illustrated in FIG. 5C, an example gridsystem 520 can divide the reference depth map into 9 grid blocks531-539. The gridlines of the grid system 520 can be defined based onbaseline information regarding the primary pose component (e.g., torso)or other pose components of the subject, and may themselves benon-uniform. For example, the upper horizontal gridline 522 can belocated at a height of a torso centroid estimated from one or moreprevious frames of depth data; the lower horizontal gridline 524 can belocated at a margin distance from the upper horizontal gridline 522; andthe left and right vertical gridlines 526, 528 can be defined based on atorso width (with certain margin value) estimated from one or moreprevious frames of depth data.

The gridlines form various grid blocks that can be processedindividually (in sequence or in parallel) based on the grid block'slocation, size, or other attributes. Illustratively, the controller canscan the identified portion 504 in a direction or manner based on alocation of the grid block the portion 504 falls into. For example, ifthe identified portion 504 falls into any of grid blocks 531-536 or 538,it is more likely that the hand of the subject is pointing upwards.Therefore, at least for purposes of computational efficiency, thecontroller can scan the identified portion 504, pixel row by pixel row,in an up-down direction. As another example, if the identified portionfalls into either grid block 537 or 539, it is more likely that the handof the subject is pointing left or pointing right. Therefore, thecontroller can scan the identified portion, pixel column by pixelcolumn, in a left-right or right-left direction, respectively. Asillustrated in FIG. 5D, an estimated boundary 542 of the hand can belocated where an increase in depth between two or more adjacent pixelrows (or columns) exceeds a threshold. The set of scanned pixels 540within the estimated hand boundary 542 can be identified ascorresponding to a hand of the subject.

In some embodiments, the hand identification is further corroborated orrevised based on 2D image data that corresponds to the reference depthmap. For example, the hand boundaries can be revised based on textureand/or contrast analysis of the 2D image data. In some embodiments, thecontroller removes the identified first hand from the collective regionand repeats the process of FIGS. 5A-5D to identify the second hand ofthe subject.

FIGS. 6A and 6B illustrate a process for identifying a primary posecomponent (e.g., torso) of the subject, in accordance with someembodiments of the presently disclosed technology. The collective regionmay include multiple pose components of the subject, non-pose components(e.g., head or legs in certain cases) of the subject, and/or objectsother than the subject. Accordingly, the controller further processesthe collective region to particularly identify the primary posecomponent. Illustratively, the controller can remove the identifiedsecondary pose component(s) (e.g., hand(s)) and analyze the rest of thecollective region 620. For example, the controller can efficientlyidentify the torso of the subject based on widths at various heights ofthe remaining collective region 620. As illustrated in FIG. 6A, theremaining collective region 620 can be examined line by line atdifferent heights. The controller can compare widths of the collectiveregion at different heights to a torso width range (e.g., between 2 and6 times the width of the hand) and identify the subject's torso basedthereon. With reference to FIG. 6A, an example line 602 indicates awidth 604 that is outside the torso width range (e.g., smaller thantwice of the hand width), while another example line 606 indicates awidth 608 that falls within the torso width range. Accordingly, asillustrated in FIG. 6B, the controller identifies a larger rectangulararea 610 as the torso of the subject. The area 610 excludes the upperhalf of the subject's head (where line 602 resides).

At block 330, the method includes determining a spatial relationshipbetween at least the identified primary pose component and secondarypose component(s). The controller can determine one or more geometricattributes (e.g., the centroid, contour, shape, or the like) of theprimary and/or secondary pose components. The controller can determineone or more vectors pointing between portions of the primary posecomponent and the secondary pose component(s). At block 335, the methodincludes generating one or more commands based on the determined spatialrelationship. Illustratively, the determined geometric attributes and/orthe vectors can be used to define a direction, speed, acceleration,and/or rotation, which serves as a basis for generating command(s) forcontrolling the mobile platform's next move(s). For example, thecontroller can generate a command that causes the mobile platform tostop moving when a user's hand is raised. As another example, thecontroller can generate a command that causes the mobile platform tomove toward a point in space that resides on a line defined by thecentroids of the identified torso and hand of the subject. In variousembodiments, the method 300 can be implemented in response to obtainingeach frame of depth data, in response to certain events (e.g., themobile platform detecting presence of one or more subjects), and/orbased on user commands.

FIG. 7 illustrates examples of mobile platforms configured in accordancewith various embodiments of the presently disclosed technology. Asillustrated, a representative mobile platform as disclosed herein mayinclude at least one of an unmanned aerial vehicle (UAV) 702, a mannedaircraft 704, an autonomous car 706, a self-balancing vehicle 708, aterrestrial robot 710, a smart wearable device 712, a virtual reality(VR) head-mounted display 714, or an augmented reality (AR) head-mounteddisplay 716.

FIG. 8 is a block diagram illustrating an example of the architecturefor a computer system or other control device 800 that can be utilizedto implement various portions of the presently disclosed technology. InFIG. 8 , the computer system 800 includes one or more processors 805 andmemory 810 connected via an interconnect 825. The interconnect 825 mayrepresent any one or more separate physical buses, point to pointconnections, or both, connected by appropriate bridges, adapters, orcontrollers. The interconnect 825, therefore, may include, for example,a system bus, a Peripheral Component Interconnect (PCI) bus, aHyperTransport or industry standard architecture (ISA) bus, a smallcomputer system interface (SCSI) bus, a universal serial bus (USB), IIC(I2C) bus, or an Institute of Electrical and Electronics Engineers(IEEE) standard 674 bus, sometimes referred to as “Firewire.”

The processor(s) 805 may include central processing units (CPUs) tocontrol the overall operation of, for example, the host computer. Incertain embodiments, the processor(s) 805 accomplish this by executingsoftware or firmware stored in memory 810. The processor(s) 805 may be,or may include, one or more programmable general-purpose orspecial-purpose microprocessors, digital signal processors (DSPs),programmable controllers, application specific integrated circuits(ASICs), programmable logic devices (PLDs), or the like, or acombination of such devices.

The memory 810 can be or include the main memory of the computer system.The memory 810 represents any suitable form of random access memory(RAM), read-only memory (ROM), flash memory, or the like, or acombination of such devices. In use, the memory 810 may contain, amongother things, a set of machine instructions which, when executed byprocessor 805, causes the processor 805 to perform operations toimplement embodiments of the presently disclosed technology.

Also connected to the processor(s) 805 through the interconnect 825 is a(optional) network adapter 815. The network adapter 815 provides thecomputer system 800 with the ability to communicate with remote devices,such as the storage clients, and/or other storage servers, and may be,for example, an Ethernet adapter or Fiber Channel adapter.

The techniques described herein can be implemented by, for example,programmable circuitry (e.g., one or more microprocessors) programmedwith software and/or firmware, or entirely in special-purpose hardwiredcircuitry, or in a combination of such forms. Special-purpose hardwiredcircuitry may be in the form of, for example, one or moreapplication-specific integrated circuits (ASICs), programmable logicdevices (PLDs), field-programmable gate arrays (FPGAs), etc.

Software or firmware for use in implementing the techniques introducedhere may be stored on a machine-readable storage medium and may beexecuted by one or more general-purpose or special-purpose programmablemicroprocessors. A “machine-readable storage medium,” as the term isused herein, includes any mechanism that can store information in a formaccessible by a machine (a machine may be, for example, a computer,network device, cellular phone, personal digital assistant (PDA),manufacturing tool, any device with one or more processors, etc.). Forexample, a machine-accessible storage medium includesrecordable/non-recordable media (e.g., read-only memory (ROM); randomaccess memory (RAM); magnetic disk storage media; optical storage media;flash memory devices; etc.), etc.

The term “logic,” as used herein, can include, for example, programmablecircuitry programmed with specific software and/or firmware,special-purpose hardwired circuitry, or a combination thereof.

Some embodiments of the disclosure have other aspects, elements,features, and/or steps in addition to or in place of what is describedabove. These potential additions and replacements are describedthroughout the rest of the specification. Reference in thisspecification to “various embodiments,” “certain embodiments,” or “someembodiments” means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the disclosure. These embodiments, evenalternative embodiments (e.g., referenced as “other embodiments”) arenot mutually exclusive of other embodiments. Moreover, various featuresare described which may be exhibited by some embodiments and not byothers. Similarly, various requirements are described which may berequirements for some embodiments but not other embodiments. Forexample, some embodiments uses depth information generated from stereocamera(s), while other embodiments can use depth information generatedfrom LiDAR(s), 3D-ToF, or RGB-D. Still further embodiments can use depthinformation generated from a combination of sensors.

To the extent any materials incorporated by reference herein conflictwith the present disclosure, the present disclosure controls.

I/We claim:
 1. A method comprising: determining a depth range where asubject is likely to appear in a current depth map of an environmentbased, at least in part, on one or more previous depth maps of theenvironment; filtering the current depth map based, at least in part, onthe depth range, to generate a reference depth map; identifying aplurality of candidate regions from the reference depth map, a depthchange of each of the plurality of candidate regions not exceeding athreshold, and the plurality of candidate regions being disconnectedwith each other; selecting a subset of the plurality of candidateregions, a size of each candidate region in the subset of the pluralityof candidate regions being within a threshold range, and the thresholdrange being determined based on an estimated size of a first posecomponent of the subject; determining a main region from the subset ofthe plurality of candidate regions based, at least in part, on aposition or a size corresponding to a second pose component of thesubject; generating a collective region by associating the main regionwith one or more target regions based, at least in part, on a relativeposition between the main region and the one or more target regions ofthe subset of the plurality of candidate regions, the one or more targetregions being likely to correspond to one or more parts of the subject;identifying the first pose component of the subject from the collectiveregion; identifying the second pose component of the subject from thecollective region after the identified first pose component isidentified from the collective region; determining one or more vectorsrepresenting a spatial relationship between the identified first posecomponent and the identified second pose component; and controlling amovement of a movable object based, at least in part, on the one or morevectors.
 2. The method of claim 1, further comprising: generating depthdata based, at least in part, on obtained images captured by a stereocamera carried by the movable object.
 3. The method of claim 2, wherein:the depth data includes the current depth map calculated based on adisparity map or intrinsic parameters of the stereo camera.
 4. Themethod of claim 2, wherein the depth data includes at least one ofunknown, invalid, or inaccurate depth information.
 5. The method ofclaim 1, wherein the movable object includes at least one of an unmannedaerial vehicle (UAV), a manned aircraft, an autonomous car, aself-balancing vehicle, a robot, a smart wearable device, a virtualreality (VR) head-mounted display, or an augmented reality (AR)head-mounted display.
 6. The method of claim 1, wherein the subjectincludes a human.
 7. The method of claim 1, wherein: the first posecomponent includes one of the one or more body parts of the subject; andthe second pose component includes another one of the one or more bodyparts of the subject.
 8. The method of claim 1, wherein: the first posecomponent includes a hand of the subject; and the second pose componentincludes a torso of the subject.
 9. The method of claim 1, wherein:identifying the first pose component and the second pose component ofthe subject from the collective region includes detecting a portion ofthe collective region based, at least in part, on a measurement ofdepth.
 10. The method of claim 9, wherein: detecting the portion of thecollective region includes detecting a portion of the collective regionthat is closest in depth to the movable object.
 11. A movable objectcomprising: a controller programmed to control the movable object,wherein the controller includes one or more processors configured to:determine a depth range where a subject is likely to appear in a currentdepth map of an environment based, at least in part, on one or moreprevious depth maps of the environment; filter the current depth mapbased, at least in part, on the depth range, to generate a referencedepth map; identify a plurality of candidate regions from the referencedepth map, a depth change of each of the plurality of candidate regionsnot exceeding a threshold, and the plurality of candidate regions beingdisconnected with each other; select a subset of the plurality ofcandidate regions, a size of each candidate region in the subset of theplurality of candidate regions being within a threshold range, and thethreshold range being determined based on an estimated size of a firstpose component of the subject; determine a main region from the subsetof the plurality of candidate regions based, at least in part, on aposition or a size corresponding to a second pose component of thesubject; generate a collective region by associating the main regionwith one or more target regions based, at least in part, on a relativeposition between the main region and the one or more target regions ofthe subset of the plurality of candidate regions, the one or more targetregions being likely to correspond to one or more parts of the subject;identify the first pose component of the subject from the collectiveregion; identify the second pose component of the subject from thecollective region after the identified first pose component isidentified from the collective region; determine one or more vectorsrepresenting a spatial relationship between the identified first posecomponent and the identified second pose component; and control amovement of the movable object based, at least in part, on the one ormore vectors.
 12. The movable object of claim 11, further comprising: astereo camera; wherein the one or more processors are further configuredto generate depth data based, at least in part, on obtained imagescaptured by the stereo camera.
 13. The movable object of claim 12,wherein: the depth data includes the current depth map calculated basedon a disparity map or intrinsic parameters of the stereo camera.
 14. Themovable object of claim 12, wherein the depth data includes at least oneof unknown, invalid, or inaccurate depth information.
 15. The movableobject of claim 11, wherein the movable object includes at least one ofan unmanned aerial vehicle (UAV), a manned aircraft, an autonomous car,a self-balancing vehicle, a robot, a smart wearable device, a virtualreality (VR) head-mounted display, or an augmented reality (AR)head-mounted display.
 16. The movable object of claim 11, wherein: thefirst pose component includes one of the one or more body parts of thesubject; and the second pose component includes another one of the oneor more body parts of the subject.
 17. The movable object of claim 11,wherein: the first pose component includes a hand of the subject; andthe second pose component includes a torso of the subject.
 18. Themovable object of claim 11, wherein: identifying the first posecomponent and the second pose component of the subject from thecollective region includes detecting a portion of the collective regionbased, at least in part, on a measurement of depth.
 19. The movableobject of claim 18, wherein: detecting the portion of the collectiveregion includes detecting a portion of the collective region that isclosest in depth to the movable object.
 20. A non-transitorycomputer-readable medium storing computer-executable instructions that,when executed, cause one or more processors associated with a movableobject to perform actions, the actions comprising: determining a depthrange where a subject is likely to appear in a current depth map of anenvironment based, at least in part, on one or more previous depth mapsof the environment; filtering the current depth map based, at least inpart, on the depth range, to generate a reference depth map; identifyinga plurality of candidate regions from the reference depth map, a depthchange of each of the plurality of candidate regions not exceeding athreshold, and the plurality of candidate regions being disconnectedwith each other; selecting a subset of the plurality of candidateregions, a size of each candidate region in the subset of the pluralityof candidate regions being within a threshold range, and the thresholdrange being determined based on an estimated size of a first posecomponent of the subject; determining a main region from the subset ofthe plurality of candidate regions based, at least in part, on aposition or a size corresponding to a second pose component of thesubject; generating a collective region by associating the main regionwith one or more target regions based, at least in part, on a relativeposition between the main region and the one or more target regions ofthe subset of the plurality of candidate regions, the one or more targetregions being likely to correspond to one or more parts of the subject;identifying the first pose component of the subject from the collectiveregion; identifying the second pose component of the subject from thecollective region after the identified first pose component isidentified from the collective region; determining one or more vectorsrepresenting a spatial relationship between the identified first posecomponent and the identified second pose component; and controlling amovement of the movable object based, at least in part, on the one ormore vectors.